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After Allowance Under 37 C.F.R. 1 .3 12 

AMENDMENTS TO THE SPECIFICATION 

In the specification please make the following amendments. 

After paragraph [0033], please make the following changes: 

f00341 The database heretofore discussed is envisioned a s being particularly useful as 
part of a media recognition system. As such, a method and apparatus for identifying media, in a 
number of contexts, is herein disclosed 

[00351 The disclosed invention is capable of recognizing an exogenous sound signal that 
is a rendition of a known recording indexed in a database. The exogenous sou nd signal mav be 
subjected to distortion and interference, including background noise, talking voices, compression 
artifacts, band-limited filtering, transmission dropouts, time warping, and other linear and 
nonlinear corruptions of the original signal. The algorithm is capable of identifying the 
corresponding original recording from a large database of recordings in time proportional to the 
logarithm of the number of entries in the database. Given sufficient computational power the 
system can perform the identification in nearly real-time, i.e. as the sound is being sampled, with 
a small lag, 

[0036] Database Construction 

TO0371 The sound database may consist of any collection of recordings, such as speech, 
music, advertisements, or sonar signatures. 

r00381 Indexing 

f00391 In order to index the sound database, each recording in the library is subjected to 
landmarking and fingerprinting analysis to gener ate an index set for each item. Each recording in 
the database has a unique index. sound.sub,13 ID. 

2S785685.1 t 2 



PAGE 4/22 * RCVD AT 6/1 112007 1 :42:16 PM [Eastern Daylight Time] * SVR:USPTO-EFXRF-2/22 * DNIS:2734022 * CSID: * DURATION (mm-ss):06-38 



06/11/07 12:41 FAX 



121005 



Application No. 10/087,204 Docket No.: 69323/P004US/1 05 11468 

Amendment dated June 1 1, 2007 
After Allowance Under 37 CJP.R. 1 .3 12 

100401 Landmarking 

[0041] Each sound recording is landmarked using methods to find distinctive and 
reproducible locations within the sound recording. The ideal landmarking algorith m will be able 
to mark the same points within a sound recording despite the presence of noise and other linear 
and nonlinear distortion. The landmarking method is conceptually independent of the 
fingerprinting orocess> but may be chosen to optingias p erformance of the latter . Landmarking 
results in a list of timenoints (Iandmark.sub.kl within the sound recording at which fingerprints 
should be calculated, A good landmarking scheme marks about 5-10 landmarks per second of 
sound recording, of course depending on the amount of activity within the sound recording. 

[00421 Power Norms 

T00431 A simple landmarking technique is to calculate the instantaneous power at every 
timepoint and to select local maxima. One way of doing this is to calculate the envelope by 
rectifying and filtering the waveform directly. Another wav is to calculate the Hilbert transform 
(quadrature^ of the signal and use the sum of the magnitudes squared of the Hilbert transform 
and the original signal 

f00441 Spectral Lp Norms 

[0045 1 The power norm method of landmarking is especially good for finding transients 
in the sound signal. The power norm is actually a special case of the more gen eral Spectral Lp 
Norm, where p=2. The general Spectral Lp Norm is calculated at each time along the sound 
signal by calculating the spectrum, for example via a Hanning-windowed Fast Fourier Transform 
fFPTV The Lp norm for that time slice is then calculated as the sum of the n-th power of the 
absolute values of the spectral components, optionally followed bv taking the p-th root. As 
before, the landmarks are chosen as the local maxima of th e resulting values over time. 

r00461 Multislice Landmarks 

25785685.1 3 
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[0047] Multi-slice landmarks mav be calculated bv taking the sum of p-th powers of 
absolute values of spectral components over multiple timeslices instead of a sin gle slice. Finding 
the local maxima of this extended sum allows optimization of placement of the multislice 
fingerprints, described below. 

[0048] Fingerprinting 

. r00491 The algorithm computes a fingerprint at each landmark timeno int in the recording. 
The fingerprint is generally a value or set of values that summarize a set of features in the 
recording near the timepoint In our implementation the fingerprint is a single numerical value 
that is a hashed function of multiple features, 

[0050] The following are a few possible fingerprint categories, 
[0051] Salient Spectral Fingerprints 

[0052] In the neighborhood of each landmark timepoint a frequency analysis is 
performed to extract the top several spectral peaks. A simple such fingerprint value is just the 
single frequency value of the strongest spectral peak. The use of such a simple peak resulted in 
surprisingly good recognition in the presence of noise, but resulted in many false positive 
matches due to the non-uniqueness of such a simple scheme. Using fingerprints consisting of the 
two or three strongest spectral peaks resulted in fewe r false positives, but in some cases created a 
susceptibility to noise if the second-strongest spectral peak was not sufficiently strong enough to 
distinguish it from its competitors in the presence of noise—the calculated fingerprint value 
would not be sufficiently stable. Despite this, the performance of this case was also good. 

[0053] Multislice Fingerprints 

[0054] In order to take advantage of the time-evolution of many sounds a set of 
timeslices is determined by adding a set of offsets to a landmark timepoint. At each resulting 
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timeslice a Salient Spectral Fingerprint is calculated. The res ulting set of fingerprint information 
is then combined to form one multitone fingerprint. Each suc h fingerprint is much more unique 
than the single-time salient spectral fingerprint since it tracks temp oral evolution, resulting in 
fewer false matches. Our experiments indicate that using two or three timeslices along with the 
single strongest spectral peak in each timcslice results i n very good per formance, even in the 
presence of significant noise. 

[0055] LPC Coefficients 

[00561 In addition to finding the strongest spectral components, there are oth er spectral 
features that can be extracted and used as fingerprints. LPC anal ysis extracts the linearly 
predictable features of a signal, such as spectral peaks, as well as sp ectral shape. LPC 
coefficients of waveform slices anchored at landmark positions can be used as fin gerprints by 
hashing the quantized LPC coefficients into an index value. LPC is well -known in the art of 
digital signal processing. 

[0057] Cepstral C oefficients 

[0058] Characterize signals that are harmonic, such as voices or man y musical 
instruments. A number of cepstral coefficients may hashed together i nto an index and used as a 
fingerprint. Cepstral analysis is well-known in the art of digital signa l processing. 

[0059] Index Set 

100601 The resulting index set for a given sound recording is a list of pairs ( fingerprint, 
landmark^ of analyzed values. Since the index set is composed simply of pairs of values, it is 
possible to use multiple landmarking and fingerprinting schemes simultan eously. For example, 
one landmarking/fingerprinting scheme may be good at de tecting unique tonal patterns, but poor 
at identifying percussion, whereas a different algorithm may have the opposite attributes. Use of 
multiple landmarking/fingerprinting strategies results in a more ro bust and richer range of 

25785&S5.1 5 



PAGE 7/22 * RCVD AT 6/11/2007 1:42:16 PM [Eastern Daylight Time] * 8VR:U8PTO-EFXRF-2/22 * DNIS:2734022 * C8ID: * DURATION (mm-ss):06-38 



06/11/07 12:41 FAX 



@008 



Application No. 10/087,204 DocketNo,: 69323/P004US/105 11468 

Amendment dated June 1 1, 2007 
After Allowance Under 37 CFiL 1 J 12 

reoperation performance. Different fingerprinting techniques mav be used together by reserving 
certain ranges of fingerprint values for certain kinds of fingerprints. For example, in a 32-bit 
fingerprint value, the first 3 bits mav be used to specify which of 8 fingerprinting schemes the 
following 29 bits are encoding, 

f 006 11 Searchable Database 

f00621 Once the index sets have been processed for each sound recording in the database, 
a searchable database is constructed in such a way as to allow fast (loetime^ searching. This is 
accomplished bv constructing a list of triplets (fingerprint landmark, sound 1D\ obtained by 
appending the corresponding sound.sub.13 ID to each doublet from each index set. All such 
triplets for all sound recordings are collected into a large index list. In order to optimize the 
search process, the list of triplets is then sorted according to the fingerprint. Fast sorting 
algorithms are well-known in the art and extensively discussed in D. E. Knuth. "The Art of 
Computer Programming, Volume 3: Sorting and Searching; 1 hereby incorporated bv reference. 
High-performance sorting algorithms can sort the list in N logfo fi time, where N is the number of 
entries in the list. Once this list is sorted it is further processed bv segmenting it such that each 
unique fingerprint in the list is collected into a new master index list. Each entry in this master 
index list contains a fingerprint value and a pointer to a list of (landmark. sound,sub,l 3 ID^ pairs. 
Rearranging the index list in this way is optional, but saves memory since each fingerprint value 
only appears once. It also speeds up the database search since the effective number of entries in 
the list is greatly reduced to a list of unique values, 

[0Q63] Alternatively, the master index list could also be constructed by inserting each 
triolet into a B-tree with non-unique fingerprints hanging off a linked list. Other possibilities 
exist for constructing the master index list. The master jndex list is preferably held in system 
memory, such as DRAM, for fast access. 

r00641 Recognition System 

257S5685.1 6 
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[0065] Once the master index list has been built it is possible to perform sound 
recognition over the database^ 

r00661 Sound Source 

[00671 Exogenous sound is provided from any number of analog or digital sources, such 
as a stereo system, television. Compact Disc plaver. radio broadcast telephone, mobile phone, 
internet stream, or computer file. The sounds may be realtime or offline. They m ay be from any. 
kind of environment such as a disco, pub, submarine, answering machine, sound fil e, stereo, 
radio broadcast, or tape recorder. Noise may be present in the sound signal, for example in the 
form of background noise, talking voices, etc, 

[00681 lnnut to the Recogniti on System 

[00691 The jound stream is then captured into the recognition system either in realtime or 
presented offline, as with a sound file. Real-time sounds may be sampled digitally an d sent to the 
system bv a sampling device such as a microphone, or be stored in a storage device such as an 
answering machine, computer file, tape recorder, telephone, mobile phone, radio, etc. The sound 
signal may be subjected to further degradation due to limitations of the channel or sound capture 
device. Sounds may also be sent to the recognition system via an internet stream. FTP, or as a 
file attachment to email. 

[0070] Preprocessing 

[0071] Once the sound signal has been converted into digital form it is processed for 
recognition. As with the construction of the master index list, landmarks and fingerprints are 
calculated. In fact, it is advisable to use the very same code that was used for processing the 
sound recording library to do the landmarking and fingerprinting of the exogenous sound input. 
The resulting index set for exogenous sound sample is also a list of pairs f fingerprint landmarks 
of analyzed values, 

25785685.1 7 
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[0072] Searching 

[0073] Searching is carried out as follows: each fingerprint/landmark pair (fingerprints, 
landmarks*) in the resulting input sound's index set is processed bv searchin g for fingerprint sub.k 
in the master index list Fast searching algorithms on an ordered list ar e well-known in the art 
and extensively discussed in Knuth. Volume 3 fibidV incorporated bv reference. If fing erprints is 
found then the corresponding list of matching flandmark*,sub.i. sound ,sub.l3HD.sub,i) pairs 
having the same fingerprint is copied and augmented with landmark to form a set of triplets of 
the formaandmark.sub.le landmark*. sub.Lsound ID.sub.iV This p rocess is repeated for all k 
ranging over the input sound's index set, with the all the resultin g triplets being collected into a 
large candidate list, 

[00741 After the candidate list is compiled it is further processed bv segmenting 
according to sound.sub.13 ID. A convenient way of doing this is to so rt the candidate list 
according to sound.sub.13 ID. or bv insertion into a B-tree. The result of this i s a list of candidate 
sound JDs, each of which h aving a scatter list of pairs of landmark timepoints, 
aandmark.sub.k.landmarl^.sub.1^ with the sound.sub.13 I D stripped off, 

[ 0075] Scanning 

[0076] The scatter list for each sound.sub.13 ID is analyzed to determine whether it is a 
likely match. 

[0077] Thresholding 

[0078] One wav to eliminate a large number of candidates is to toss out those having a 
small scatter list Clearly, those having only 1 entry in their scatter lists cannot be matched. 

[0079] Alignment 

25785685.1 8 
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[00801 A kev insight into the matching process is that the time evolution in matching 
sounds must follow a linear correspondence, assuming that the timebases on both sides are 
steady. This is almost always true unless the sound on one side has been nonlinearlv warped 
intentionally ot subject to defective playback equipment such as a tape deck with a warbling 
speed problem. Thus, the matching fingerprints yielding correct landmark pairs (landmark.sub.n, 
landmark*.sub.n) in the scatter list of a given sound.sub.13 ID must have a linear cor respondence 
of the form 

landmark* ,sub.n=m*Iandmark.sub.n+offset 

[0081] where m is the slope, and should be ne^r L landmark, sub.n is the corresponding 
timepoint within the exogenous sound signal, landmark*. sub ,n is the corresponding timepoint 
within the library sound recording indexed by sound.sub.13 ID. and offset is the time offset into 
the library sound recording corresponding to the beginning of the exogenous sound signal. 

[00821 This relationship ties together the true landmark/fingerprint correspondences 
between the exogenous sound signal and the correct library sound recording with high 
probability, and excludes outlier landmark pairs. Thus, the problem of determining whether there 
is a match is reduced to finding a diagonal line with slope near I within the scatterplot of the 
points in the scatter list 

[0083] There are many ways of finding the diagonal line. A preferred method starts by 
subtracting m*landmark.sub,n from both sides of the above equation. 

(landmark* .sub.n-m*landmark.sub.n>offset 

[0084] Assuming that m is approximately 1, we arrive at 

nandmark.sub.n'landmark.sub.n>=offset 

25785685,1 9 
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[00851 The diagonal-finding problem is then reduced to finding mul tiple landmark pairs 
that cluster near the same offset value. This is accomplished easily b v calculating a histogram of 
the resulting offset values and searching for the offset bin with the highest number of p oi nts. 
Since the offset must be positive if the exogenous sound signal is fullv contained within the 
correct library sound recording. landmark pairs that result in a negative offset are excluded. 

[0086] The winn ing offset bin of the histogram is noted for each qualifying sound.sub.13 
ID. and the corresponding score is the number of points in the winning bin. The sound recording 
in the candidate list with the highest score is chosen as the winner. Th e winning soundsub. 1 3 ID 
is provided to an output means to signal the suc cess of the identification. 

[00871 To prevent false identification, a minimum threshold s core may be used to gate 
the success of the identification process. If no library sound recording meets the minimum 
threshold then there is no identification, 

[0088] Pipelined Recognition 

[0089] In a real-time system the sound is provided to the recognition system 
incrementally over time. In this case it is possible to process t he data in chunks and to update the 
index set incrementally. Each update period the newlv augm ented index set is used as above to 
retrieve candidate library sound recordings using the searching and s canning steps above. The 
advantage of this approach is that if sufficient data has been collected to identify the sound 
recording unambiguously then the data acquisition may b e terminated and the result may be 
announced. 

[0090] Reporting the Result 

[0091] Once the correct sound has been identified, the result is reported. Among the 
result-reporting means, this may be done using a computer printout email. SMS text messaging 
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to a mobile phone, computer- generated voice annotation over a telephone, posting of the result to 
an internet account which the user can access later. 

[0034} [0092] A lthough various embodiments are specifically illustrated and described 
herein, it will be appreciated that modifications and variations of the invention are covered by the 
above teachings and within the purview of the appended claims without departing from the spirit 
and intended scope of the invention. For example, while several of the embodiments depict the 
use of specific data formats and protocols, any formats or protocols may suffice. Moreover, 
while some of the embodiments describe specific embodiments of computer, clients, servers, 
etc., other types may be employed by the invention described herein. Furthermore, these 
examples should not be interpreted to limit the modifications and variations of the invention 
covered by the claims but are merely illustrative of possible variations. 
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